Using survey data to estimate the impact of the omicron variant on vaccine efficacy against COVID-19 infection

Symptoms-based detection of SARS-CoV-2 infection is not a substitute for precise diagnostic tests but can provide insight into the likely level of infection in a given population. This study uses symptoms data collected in the Global COVID-19 Trends and Impact Surveys (UMD Global CTIS), and data on variants sequencing from GISAID. This work, conducted in January of 2022 during the emergence of the Omicron variant (subvariant BA.1), aims to improve the quality of infection detection from the available symptoms and to use the resulting estimates of infection levels to assess the changes in vaccine efficacy during a change of dominant variant; from the Delta dominant to the Omicron dominant period. Our approach produced a new symptoms-based classifier, Random Forest, that was compared to a ground-truth subset of cases with known diagnostic test status. This classifier was compared with other competing classifiers and shown to exhibit an increased performance with respect to the ground-truth data. Using the Random Forest classifier, and knowing the vaccination status of the subjects, we then proceeded to analyse the evolution of vaccine efficacy towards infection during different periods, geographies and dominant variants. In South Africa, where the first significant wave of Omicron occurred, a significant reduction of vaccine efficacy is observed from August-September 2021 to December 2021. For instance, the efficacy drops from 0.81 to 0.30 for those vaccinated with 2 doses (of Pfizer/BioNTech), and from 0.51 to 0.09 for those vaccinated with one dose (of Pfizer/BioNTech or Johnson & Johnson). We also extended the study to other countries in which Omicron has been detected, comparing the situation in October 2021 (before Omicron) with that of December 2021. While the reduction measured is smaller than in South Africa, we still found, for instance, an average drop in vaccine efficacy from 0.53 to 0.45 among those vaccinated with two doses. Moreover, we found a significant negative (Pearson) correlation of around − 0.6 between the measured prevalence of Omicron in several countries and the vaccine efficacy in those same countries. This prediction, in January of 2022, of the decreased vaccine efficacy towards Omicron is in line with the subsequent increase of Omicron infections in the first half of 2022.


B.2 Creating the Machine Learning Classifier: Random Forest
Each response to the survey includes a large number of questions (obviously, not all participants answer all questions). For training and inference of the Random Forest classifier, we use only questions with answers holding discrete values. From these we remove questions B7 and B8, which are only used to create the groundtruth set, as well as related questions, such as "B0: As far as you know, have you ever had coronavirus ?" and "B15: Do any of the following reasons describe why you were tested for COVID-19 in the past 14 days?". Finally, we do not use the questions related to vaccination, since we do not want them to influence the classification. The set of questions used can be found in Appendix D. The answers to this set of questions are "dummified" before they are used, i.e., a question with k possible answers is replaced by k binary attributes. The Random Forest model is generated with the randomForest function in R. No hyperparameter tuning is done, and the standard options of the function are used, with the exception of limiting the model to 100 trees to reduce the training time.
Observe that the questions in Appendix D include all symptoms, but also have many more questions, including behavioral or demographic aspects. Additionally, the Random Forest classifier can give di↵erent weights to di↵erent symptoms, while previously proposed symptom based criteria are based on determining only whether a symptom is present or not. Thus, overall the Random Forest classifier is much more versatile than the symptom-based criteria described in the previous section. Additionally, there are other aspects that make the Random Forest classifier(s) more adaptive: • Firstly, we create di↵erent models for di↵erent countries. It is expected that di↵erent countries will have local characteristics, thus training a di↵erent classifier for each country can capture them.
• Secondly, we create not one but several models per country: one for each 3-month period. This allows the model to capture and adapt to aspects that change over time, like the level of vaccination, the surge of new variants, or the stringency of measures imposed.

B.3 Evaluating the Classifiers
In order to verify whether the Random Forest classifier provides better proxy estimates than the symptomsbased classifiers, we selected a set of countries and tested the performance of each classifier in the last two quarters of 2021. To this end, we randomly divided the ground-truth set into a training and a testing set, with 70% and 30% of the responses of the ground-truth set in each subset, respectively. eTable 9 shows the results for three countries that have detected Omicron in December for the periods of July-September 2021 (2021-Q3) and of October-December 2021 (2021-Q4). The classification performance metrics used are: • Accuracy: Ratio of cases correctly classified over the size of the test set.
• Sensitivity / recall: Ratio of cases correctly classified as positive over the number of positive cases.
• Specificity: Ratio of cases correctly classified as negative over the number of negative cases.
• F-score: Harmonic mean of precision and recall, where the precision is the ratio of cases correctly classified as positive over the number of all cases classified as positive.
As can be seen in eTable 9, Random Forest almost always shows the highest performance (marked in bold) among the classification methods used.
As another test, we then selected a set of countries that includes South Africa, along with the 20 countries that have the largest number of available responses in the UMD Global CTIS dataset. For each of these countries, the first two columns of eTable 10 show the o cial Test Positivity Rates obtained via Our World In Data [32, 36] (OWID TPR) and the corresponding survey-based estimate from the UMD Global CTIS dataset (CTIS TPR). The remaining columns show the Pearson correlation coe cient between the time series of Confirmed active cases (computed based on data from Johns Hopkins University [38] as described by Alvarez et al. [29]) and that of each of the candidate proxies in the period June 18th, 2021 (start of the first period considered in [16]), to December 31st, 2021. All time series have one value per day, which is the average of the latest 14 days.
We can make two observations from eTable 10. First, among all candidate proxies considered, Random Forest achieves at least 0.9 correlation for the largest number of countries. Second, 17 out of the 21 countries exhibit low TPR ( 0.1) values in at least one of the first two columns (either o cial or survey-based TPR), and 11 out of the 21 exhibit low values in both columns, with 7 having values no higher than 0.05 (the WHO considers countries to have the epidemic under control when their TPR is below 0.05 [34]). This suggests that such countries keep the case count under control and report more accurate o cial data on confirmed cases. We can thus interpret the higher correlation between the Random Forest proxy and the Confirmed time series for the countries with low TPR as a sign that this proxy constitutes the most promising option among the five proxies considered, and thus will also be more accurate for countries for which the o cial data will be less reliable.

C List of Symptoms
In the UMD Global CTIS the following question is asked: "B1 In the last 24 hours, have you had any of the following?" [28]. The following is the list of possible answers (non exclusive): • Fever (B1 1).
The questions removed are B0, B7, B8, B15, and all the questions related to vaccination (V-questions).

E Vaccination in South Africa
eFigure 1 shows an area plot, estimated from the UMD Global CTIS data, of the proportion of vaccinated with 1 dose, Vaccinated with 2 doses, and Unvaccinated from June 18th until December 31st, 2021. As can be seen, the ratio of the population vaccinated is low at the beginning of this interval, especially with two doses. Then, we can see a high increase in Vaccinated between July and October. We point out that in each time point of this plot the proportions are provided by a di↵erent set of surveys respondents, and it still closely captures the increase of vaccination. eTable 1 shows the distribution of doses used and population vaccinated with the two types of vaccines delivered in South Africa: Johnson&Johnson and Pfizer/BioNTech. Some columns are inferred from the available data: total doses, people vaccinated, and people fully vaccinated. The dates shown are the closest available to the start and end of the intervals considered. This data has been obtained from Our World in Data [36]. In the same table, the rightmost columns present the percentage of responses to the UMD CTIS survey that report having received one or two doses of vaccination. As can be seen, these percentages are higher than the actual values (roughly for times higher in all dates for two doses) which hints that the respondents to the UMD CTIS survey are not a uniform sample of the population of South Africa.

F Countries with Omicron Prevalence
eTable 3 shows basic o cial vaccination data on December 31st, 2021, of these countries. eTable 4 shows the vaccine types delivered in these countries by the end of 2021. This data has been obtained from Our World in Data [32,35,36]. Tables 2 and 3 show the COVID-19 prevalence and the vaccine e cacy in October and December in the countries with presence of Omicron as defined in Section 2.3.2. When data is insu cient to meet the defined selection criteria, it is omitted and replaced by "-". Both tables are presented alphabetically by country name and also share a column depicting the most recent data on Omicron prevalence among all virus samples.   dummies2aggregates.R Compute estimates of active cases using symptoms combinations and ML models, and aggregate the data per day. run.sh Processes the aggregated CTIS estimates to produce the tables and plots for this paper. script-variants-monthly.R Computation of Omicron presence since December 15th, 2021. script-TPR.R Generation of data for eTable 10. script-country-plots-data-create.R Generation of data for ZA plots. script-country-plots.R Generation of ZA plots. script-vaccination-plot-ZA.R Generation of the vaccination plots for ZA. script-e cacy-ZA.R Generation of e cacy tables for ZA. script-e cacy-ZA-Gauteng.R Generation of e cacy tables for Gauteng. script-e cacy-data-create.R Generation of e cacy data for world countries. script-e cacy-plots.R Generation of e cacy plots for world countries. script-e cacy-tables.R Generation of e cacy tables for world countries.
eTable 11: Scripts used to process the data in this study. run pipeline.sh invokes a series of R scripts as presented to transform the CTIS microdata into estimates of active cases aggregated per day. run.sh invokes R scripts to process the aggregated estimates and other data to produce the tables and figures presented in the paper.